options(scipen=666)
library(data.table)
library(ggplot2)

Reading data in R

You can use ?fread to read csv, txt, or any other file format. In general fread is smart enough to realise how the data is delimited. Therefore you don’t need to pass a sep=‘,’ argument.

file_name = "prices_nasdaq.csv"
dt = fread(file_name)
dt

Some parameters of fread you might use eventually.

In general fread will just work 95% of the times, but in some cases you need to use some parameters.

  • colClasses: You can pass a named character vector of classes.
  • na.strings: You need to pass which strings to be recognized as NA values. For example the empty character is read as character, "".

Basic operations

1) Subsetting

1.1 Select the first 5 rows:

dt[1:5]

1.2 Select data of one ticker:

In this case I simply construct a boolean vector where ticker matches a string. Internally these are a vector of TRUE, TRUE, … for the rows I want to keep and a vector of FALSE, FALSE, … for the rows that don’t satisfy the boolean condition.

dt[ticker == 'MELI']

1.3 Understanding column classes

If call str on the data.table we get a list of the columns and their class (or data type). Here we can see date is being read as character and the format seems year-month-day. The rest of the variables seem correct to me.

str(dt)
## Classes 'data.table' and 'data.frame':   7880855 obs. of  8 variables:
##  $ date    : chr  "2018-04-26" "2018-04-27" "2018-04-30" "2018-05-01" ...
##  $ open    : num  23 26.2 25.6 25.2 27 ...
##  $ high    : num  28.2 26.7 26.4 27 29.8 ...
##  $ low     : num  22.2 24 24.9 25 26.9 ...
##  $ close   : num  27 25 24.9 26.9 29.1 ...
##  $ volume  : num  3906000 626300 229000 474500 456900 ...
##  $ adjusted: num  27 25 24.9 26.9 29.1 ...
##  $ ticker  : chr  "LASR" "LASR" "LASR" "LASR" ...
##  - attr(*, ".internal.selfref")=<externalptr> 
##  - attr(*, "index")= atomic  
##   ..- attr(*, "__ticker")= int  566359 566360 566361 566362 566363 566364 566365 566366 566367 566368 ...

1.4 Coercing the class of a column by reference

The idea now is to change the class ( data type ) of the date column. To do this we will use the assignment by reference operator ( := ), also called walrus operator.

dt[, date:=as.Date(date, format="%Y-%m-%d")]

str(dt)
## Classes 'data.table' and 'data.frame':   7880855 obs. of  8 variables:
##  $ date    : Date, format: "2018-04-26" "2018-04-27" ...
##  $ open    : num  23 26.2 25.6 25.2 27 ...
##  $ high    : num  28.2 26.7 26.4 27 29.8 ...
##  $ low     : num  22.2 24 24.9 25 26.9 ...
##  $ close   : num  27 25 24.9 26.9 29.1 ...
##  $ volume  : num  3906000 626300 229000 474500 456900 ...
##  $ adjusted: num  27 25 24.9 26.9 29.1 ...
##  $ ticker  : chr  "LASR" "LASR" "LASR" "LASR" ...
##  - attr(*, ".internal.selfref")=<externalptr> 
##  - attr(*, "index")= atomic  
##   ..- attr(*, "__ticker")= int  566359 566360 566361 566362 566363 566364 566365 566366 566367 566368 ...

Now we can see the date is of Date class.

2. Counting

2.1 Counting by one group

This is probably one of the features I use the most as it’s needed often to understand the data. The .N symbol is used to count the number of rows from a group (in this case ticker). you can read the docs here on ?"N". Using .N creates a column called N.

dt[, .N, ticker]

We can easily sort this result by decreasing order by using the - sign before the column that gets generated with .N. I often pipe this operation with another [][] call.

dt[, .N, ticker][order(-N)]

Sometimes it’s useful to get more observations printed! We can AGAIN pipe the result and this results in a one liner.

With the following command I get the top 10 groups.

dt[, .N, ticker][order(-N)][1:10]

It’s often useful to save this results and plot a histogram. In this case to understand how many days of data are available for each ticker.

qdays = dt[, .N, ticker][order(-N)]
qdays[1:2]

The following plot shows the distribution of the amount of days available for each ticker. We can at least be wary this data might have issues. But as these are companies/ETFs it’s possible these were created in different periods of time. Therefore we can’t say the data has issues or not.

ggplot(qdays, aes(x=N)) + 
  geom_histogram() + 
  xlab("Amount of days available per ticker")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2.2 Return calculation and lagged operations

data.table provides the shift function for this. We first need to sort the data.table, take the adjusted close price lag and then compute the percent difference to the previous value.

There are multiple ways to sort a DT. For simplicity I’m often using sort but there are other options such as setkey or setkeyv.

dt = dt[order(ticker, date)]
dt

We sort the table in ascending order as this is how the data gets generated.

dt[ticker == "MELI"]

2.3 create a lag column

I pass the adjusted variable to shift, n=1 and group by ticker. The idea is to create a lag for each ticker group.

dt[, lag:=shift(adjusted, n=1), by=ticker]

Now the lag column looks like this for the “MELI” ticker that is BTW a company siimlar to Amazon.

dt[ticker == "MELI"]

2.4 Return calculation

I can now compute the daily return of each asset. This shows how the stock is moving each day and will allow some calculations to be done in the next section.

dt[, R:=ifelse(lag == 0, 0, adjusted / lag - 1)]

I’ll remove the first observation that has NA values for simplicity

dt = dt[complete.cases(dt)]

3. Aggregations

3.1 Global values

We can start with a top-down approach and understand what is the return distribution of all available assets.

dt[, quantile(R, c(0.01, 0.25, 0.5, 0.75, 0.99))]
##          1%         25%         50%         75%         99% 
## -0.11111111 -0.01242236  0.00000000  0.01203080  0.13249998

It’s clear the median return is zero for all assets and this makes sense for this problem.

3.2 Return at the ticker level

This is probably the second feature of DT I use the most. Computing a function by a group. In this case I’m grouping by ticker and applying a set of functions to the R variable. This returns a new object I called ret_agg where each new column is the result of the group-by operation. Therefore the result R_median variable will have the median of each ticker.

ret_agg = dt[, .(R_median = median(R), 
                 R_mean = mean(R), 
                 R_sd = sd(R), 
                 R_mad = mad(R)), 
             by=ticker]

ret_agg

It seems there are quite a bit of outliers in the data. The median seems like a better metric to get a broad idea of the return of an asset.

ggplot(ret_agg, aes(R_mean)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(ret_agg, aes(R_median)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(ret_agg, aes(R_mad)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

By using mad (median absolute deviation) and the median we get a better idea of the market returns at the ticker level compared to using the mean or sd.

ggplot(ret_agg, aes(R_mad, R_median)) +
  geom_point() +
  ggtitle("Relation between MAD and median for each ticker")

4 Making the code more readable

The above is for demo purposes but in practise I believe using simple functions helps code readability.

To summarize some of the above operations I would create two functions:

# reading the data with fread and changing the date class to Date
read_data <- function(file_name){
  dt = fread(file_name)
  dt[, date:=as.Date(date, format="%Y-%m-%d")]
  dt
}


compute_returns <- function(dt){
  dt = dt[order(ticker, date)] # sorting
  dt[, lag:=shift(adjusted, n=1), by=ticker] # create a lag
  dt[, R:=ifelse(lag == 0, 0, adjusted / lag - 1)]  # compute the return
  dt[complete.cases(dt)] # remove NA columns
}

The following code looks much clearer to follow to me. I first read the data:

dt = read_data("prices_nasdaq.csv")

Then I compute the returs:

dt = compute_returns(dt)